7 research outputs found
Gower as Data: Exploring the Application of Machine Learning to Gowerâs Middle English Corpus
Distant reading, a digital humanities method in wide use, involves processing and analyzing a large amount of text through computer programs. In treating texts as data, these methods can highlight trends in diction, themes, and linguistic patterns that individual readers may miss or critical traditions may obscure. Though several scholars have undertaken projects using topic models and text mining on Middle English texts, the nonstandard orthography of Middle English makes this process more challenging than for our counterparts in later literature.
This collaborative project uses Gowerâs Confessio Amantis as a small, fixed corpus for analysis. We employ natural language processing to reexamine the Confessioâs themes, adding data analysis to the more traditional close reading strategies of Gower scholarship. We use Gowerâs work as a case study both to help reduce the potential variants across textual versions and to more deeply investigate the corpus than distant reading normally allows.
Here, we share our initial findings as well as our methodologies. We hope to share resources that will allow other scholars to engage in similar types of projects
Rare but Severe Neural Machine Translation Errors Induced by Minimal Deletion: An Empirical Study on Chinese and English
We examine the inducement of rare but severe errors in English-Chinese and
Chinese-English in-domain neural machine translation by minimal deletion of the
source text with character-based models. By deleting a single character, we can
induce severe translation errors. We categorize these errors and compare the
results of deleting single characters and single words. We also examine the
effect of training data size on the number and types of pathological cases
induced by these minimal perturbations, finding significant variation. We find
that deleting a word hurts overall translation score more than deleting a
character, but certain errors are more likely to occur when deleting
characters, with language direction also influencing the effect.Comment: COLING 2022 Camera Read
Pathologies of Neural Models Make Interpretations Difficult
One way to interpret neural model predictions is to highlight the most
important input features---for example, a heatmap visualization over the words
in an input sentence. In existing interpretation methods for NLP, a word's
importance is determined by either input perturbation---measuring the decrease
in model confidence when that word is removed---or by the gradient with respect
to that word. To understand the limitations of these methods, we use input
reduction, which iteratively removes the least important word from the input.
This exposes pathological behaviors of neural models: the remaining words
appear nonsensical to humans and are not the ones determined as important by
interpretation methods. As we confirm with human experiments, the reduced
examples lack information to support the prediction of any label, but models
still make the same predictions with high confidence. To explain these
counterintuitive results, we draw connections to adversarial examples and
confidence calibration: pathological behaviors reveal difficulties in
interpreting neural models trained with maximum likelihood. To mitigate their
deficiencies, we fine-tune the models by encouraging high entropy outputs on
reduced examples. Fine-tuned models become more interpretable under input
reduction without accuracy loss on regular examples.Comment: EMNLP 2018 camera read
Donât Until the Final Verb Wait: Reinforcement Learning for Simultaneous Machine Translation
We introduce a reinforcement learning-based approach to simultaneous ma-chine translationâproducing a trans-lation while receiving input wordsâ between languages with drastically dif-ferent word orders: from verb-final lan-guages (e.g., German) to verb-medial languages (English). In traditional ma-chine translation, a translator must âwait â for source material to appear be-fore translation begins. We remove this bottleneck by predicting the final verb in advance. We use reinforcement learn-ing to learn when to trust predictions about unseen, future portions of the sentence. We also introduce an evalua-tion metric to measure expeditiousness and quality. We show that our new translation model outperforms batch and monotone translation strategies.
Syntax-based Rewriting for Simultaneous Machine Translation
Abstract Divergent word order between languages causes delay in simultaneous machine translation. We present a sentence rewriting method that generates more monotonic translations to improve the speedaccuracy tradeoff. We design grammaticality and meaning-preserving syntactic transformation rules that operate on constituent parse trees. We apply the rules to reference translations to make their word order closer to the source language word order. On Japanese-English translation (two languages with substantially different structure), incorporating the rewritten, more monotonic reference translation into a phrase-based machine translation system enables better translations faster than a baseline system that only uses gold reference translations